• pyforest: Provides lazy imports for commonly used data science libraries, so you don't need to manually import them. It's intended to simplify code and reduce clutter - in other words: libraries pandas, matplotlib, seaborn, numpy and sklearn in one code line. (https://pyforest.readthedocs.io/en/latest/)
  • math: A built-in Python module that provides mathematical functions and constants, such as sin, cos, and pi. (https://docs.python.org/3/library/math.html)
  • sidetable: Extends pandas' DataFrame with an additional method called stb.freq_table, which returns a frequency table of the specified column(s), used here for missing values. (https://github.com/great-expectations/sidetable)
  • plotly: Provides interactive data visualization tools, including charts and graphs. (https://plotly.com/python/)
  • plotly.express: A high-level interface for creating interactive data visualizations with Plotly (my favorite one for visualization).(https://plotly.com/python/plotly-express/)
  • plotly.graph_objs: A module that provides low-level tools for creating Plotly charts and graphs, such as traces and layouts.(https://plotly.com/python/graph-objects/)
  • plotly.subplots: A module that provides tools for creating subplots with Plotly.(https://plotly.com/python/subplots/)
  • matplotlib.patches: A module that provides classes for creating graphical shapes, such as rectangles and circles, in Matplotlib.(https://matplotlib.org/stable/api/patches_api.html)
  • scipy.stats: A module that provides statistical functions and tools, such as ttest_ind for performing a two-sample t-test.(https://docs.scipy.org/doc/scipy/reference/stats.html)
  • pycountry: Provides ISO databases for country-related data, such as country names, codes, and subdivisions.(https://pypi.org/project/pycountry/)
  • pycountry_convert: Provides tools for converting between country-related codes and data, such as ISO codes and country names.(https://pypi.org/project/pycountry-convert/)
  • warnings: A built-in Python module that provides tools for issuing and handling warnings.(https://docs.python.org/3/library/warnings.html)
  • pygWalker: Provides tools to visualize the simulation output and analyze the simulation results.(https://github.com/Kanaries/pygwalker)

In [1]:
from pyforest import *
import math
import sidetable
import plotly as pl
import plotly.express as px
import plotly.graph_objs as go
from plotly.subplots import make_subplots
import matplotlib.patches as mpatches
import matplotlib
from scipy import stats
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import bar_chart_race as bcr
import pygwalker as pyg
import pycountry
import pycountry_convert as pc
import warnings

World-wide Population:¶

In [2]:
warnings.filterwarnings('ignore')
In [3]:
dat = pd.read_csv(r"C:\Users\elon2\OneDrive\Desktop\programming\Data Analysis Projects\Data\worldPopulationData.csv")
dat.head(3)
Out[3]:
Series Name Series Code Country Name Country Code 1960 [YR1960] 1961 [YR1961] 1962 [YR1962] 1963 [YR1963] 1964 [YR1964] 1965 [YR1965] ... 2012 [YR2012] 2013 [YR2013] 2014 [YR2014] 2015 [YR2015] 2016 [YR2016] 2017 [YR2017] 2018 [YR2018] 2019 [YR2019] 2020 [YR2020] 2021 [YR2021]
0 Population, total SP.POP.TOTL Israel ISR 2114020 2185000 2293000 2379000 2475000 2563000 ... 7910500 8059500 8215700 8380100 8546000 8713300 8882800 9054000 9215100 9364000
1 Population, total SP.POP.TOTL Afghanistan AFG 8622466 8790140 8969047 9157465 9355514 9565147 ... 30466479 31541209 32716210 33753499 34636207 35643418 36686784 37769499 38972230 40099462
2 Population, total SP.POP.TOTL Albania ALB 1608800 1659800 1711319 1762621 1814135 1864791 ... 2900401 2895092 2889104 2880703 2876101 2873457 2866376 2854191 2837849 2811666

3 rows × 66 columns

In [4]:
dat.tail(3)
Out[4]:
Series Name Series Code Country Name Country Code 1960 [YR1960] 1961 [YR1961] 1962 [YR1962] 1963 [YR1963] 1964 [YR1964] 1965 [YR1965] ... 2012 [YR2012] 2013 [YR2013] 2014 [YR2014] 2015 [YR2015] 2016 [YR2016] 2017 [YR2017] 2018 [YR2018] 2019 [YR2019] 2020 [YR2020] 2021 [YR2021]
268 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
269 Data from database: World Development Indicators NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
270 Last Updated: 12/22/2022 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

3 rows × 66 columns

Cleaning and organizing:¶

In [5]:
yrs = [dat.columns[x][:4] for x in range(len(dat.columns)) if x > 3]
for i in range(len(yrs)):
    yrs[i] = int(yrs[i])
cols = list(dat.columns[:4])
for x in range(4):    
    if ' ' in cols[x]:
        cols[x] = cols[x].replace(" ", "")     # removing spaces for comfort
cols.extend(yrs)
dat.columns = cols
In [6]:
dat = dat.dropna(subset = ['CountryName'])
dat = dat.drop(dat.index[252])
In [7]:
dat.dtypes
Out[7]:
SeriesName     object
SeriesCode     object
CountryName    object
CountryCode    object
1960           object
                ...  
2017           object
2018           object
2019           object
2020           object
2021           object
Length: 66, dtype: object
In [8]:
dat['CountryName'].head(220).unique
Out[8]:
<bound method Series.unique of 0                           Israel
1                      Afghanistan
2                          Albania
3                          Algeria
4                   American Samoa
                  ...             
215                         Zambia
216                       Zimbabwe
217    Africa Eastern and Southern
218     Africa Western and Central
219                     Arab World
Name: CountryName, Length: 220, dtype: object>

* Countries by their names are up to line 217¶

In [9]:
countries = dat.iloc[:217,]
countries = countries.replace('..',0)

* Missing values¶

In [10]:
countries.stb.missing().sum()
Out[10]:
missing        0.0
total      14322.0
percent        0.0
dtype: float64

The increase in the world's population over the years :¶

In [11]:
wrld = dat[(dat.CountryName == 'World')]
wrld
Out[11]:
SeriesName SeriesCode CountryName CountryCode 1960 1961 1962 1963 1964 1965 ... 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
265 Population, total SP.POP.TOTL World WLD 3031564839 3072510552 3126934725 3193508879 3260517816 3328284623 ... 7140895722 7229184551 7317508753 7404910892 7491934113 7578157615 7661776338 7742681934 7820981524 7888408686

1 rows × 66 columns

In [12]:
# Clean and sort:
forvis = wrld.iloc[:,4::5]
vals = pd.DataFrame(sorted(wrld.iloc[:,4::5].values)).values
forvis
Out[12]:
1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 2020
265 3031564839 3328284623 3690306927 4070114517 4442440474 4850160867 5293517142 5726801833 6144322697 6552571570 6969631901 7404910892 7820981524
In [13]:
plt.figure(figsize = (20,5))
ax = sns.lineplot(x = forvis.columns, y = vals[::-1].flatten(), marker='h', markersize = 12)
plt.gca().invert_yaxis()
plt.title('World Population Increase Over The Years :\n',size = 25,fontweight = "bold")
# plt.text(2021, 5000, 'Significant population increase', fontsize=12, fontweight='bold')
plt.xlabel('Year', size = 20)
plt.ylabel('log(Population)',size = 20)
plt.xlim([1965,2021])
ax.set_yscale("log")
plt.show()

Cheking Repetitiveness with Regression Models:¶

In [14]:
nwrld = wrld.iloc[:,4:]
nwrld = nwrld.T.reset_index().rename(columns = {'index':'Year', 265:'Population'})
nwrld.Population = pd.to_numeric(nwrld.Population)
regmodel = DecisionTreeRegressor(max_depth=5)
X = nwrld[['Year']]
y = nwrld[['Population']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
regmodel.fit(X_train,y_train)
pred = regmodel.predict(X)

actual_values = np.array(y['Population'])
predicted_values = pred
fig, ax = plt.subplots(figsize=(20, 6))
ax.plot(pd.to_numeric(actual_values), color='black', label='Actual')
ax.plot(pred, color='orange', label='Predicted')
ax.set_xlabel('Time (years)',size = 15)
ax.set_ylabel('Population(Billions people)',size = 15)
ax.legend()
plt.title("Model's Quality\n", size = 20)
plt.show()
In [15]:
pred2 = regmodel.predict(X_test)
mse = mean_squared_error(y_test, pred2)
r2 = r2_score(y_test, pred2)
mae = mean_absolute_error(y_test, pred2)
pd.DataFrame([{'MSE': mse, 'MAE' : mae, 'r^2':r2, 'r' : r2**0.5}])
Out[15]:
MSE MAE r^2 r
0 9.307788e+15 9.406073e+07 0.994547 0.99727
In [16]:
print(f"The model is not capable of predicting values above the highest known value, for example - prediction for world's population in 2500 is {int(regmodel.predict([[2500]]))},     equal to 2021")
The model is not capable of predicting values above the highest known value, for example - prediction for world's population in 2500 is 7854695105,     equal to 2021

Scipy's stats can be useful for resolving the problem¶

In [17]:
x = nwrld.Year
y = nwrld.Population
slope, intercept, r, p, std_err = stats.linregress(x,y)
pd.DataFrame([{'Slope': slope, 'Intercept' : intercept, 'r^2':r**2, 'r' : r, 'std':std_err}])
Out[17]:
Slope Intercept r^2 r std
0 8.176666e+07 -1.573898e+11 0.99904 0.99952 327273.129817
prediction for 2050:¶
In [18]:
def lr_forecast(x):
    # Our regression equation :
    return slope * x + intercept
lr_model = list(map(lr_forecast, x))
fig, ax = plt.subplots(figsize=(20, 6))
plt.title("Regression Model's Quality\n", size = 20)
plt.scatter(x, y, color = 'black', label = 'Actual' )
plt.plot(x, lr_model, color = 'orange', label = 'Predicted')
plt.legend()
plt.show()
In [19]:
lr_forecast(2500)
Out[19]:
47026813387.46753

Annual Grow Rate:¶

In [20]:
last = len(nwrld) - 1
# In the end of 2022 - World's population was 8 Billion.
nwrld['one_year_diff'] = [nwrld.Population[c+1] - nwrld.Population[c] if c < last else 8000000000-nwrld.Population[last] for c in range(len(nwrld.Population))]
nwrld
Out[20]:
Year Population one_year_diff
0 1960 3031564839 40945713
1 1961 3072510552 54424173
2 1962 3126934725 66574154
3 1963 3193508879 67008937
4 1964 3260517816 67766807
... ... ... ...
57 2017 7578157615 83618723
58 2018 7661776338 80905596
59 2019 7742681934 78299590
60 2020 7820981524 67427162
61 2021 7888408686 111591314

62 rows × 3 columns

In [21]:
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(16, 7))
sns.barplot(x='Year', y='one_year_diff', data=nwrld, ax=ax1, capsize=0.2)
# ax1.set_xlabel('Year', fontsize=14)
ax1.set_ylabel('Delta', fontsize=18)
ax1.set_title('Bars\n', fontsize=20)
ax1.set_xticklabels([])

sns.regplot(x='Year', y='one_year_diff', data=nwrld, ax=ax2, scatter_kws={'s': 20, 'cmap': 'rainbow'})
ax2.set_xlabel('Year', fontsize=14)
ax2.set_ylabel('')
ax2.set_title('Regression\n', fontsize=20)
x = nwrld['Year']
y = nwrld['one_year_diff']
coefficients = np.polyfit(x, y, 1)
m,b = coefficients[0],coefficients[1]
equation = f"y = {m:.2f}x + {b:.2f}"
ax2.annotate(equation, xy=(0.05, 0.95), xycoords='axes fraction', fontsize=14, ha='left', va='top')
plt.tight_layout()
plt.show()
In [22]:
fig = go.Figure()
hover_template1 = "<b>Year:</b> %{x}<br><b>Delta * 75 :</b> %{y:.2f}"
hover_template2 = "<b>Year:</b> %{x}<br><b>Population :</b> %{y:.2f}"

# multiply by 75 the diffs for comparable vizualization
fig.add_trace(go.Scatter(x=nwrld.Year, y=nwrld.one_year_diff*75, line=dict(color="purple", width=3), hovertemplate=hover_template1, name="Difference"))
fig.add_trace(go.Scatter(x=nwrld.Year, y=nwrld.Population, line=dict(color="red", width=3), hovertemplate=hover_template2, name="Population"))
fig.update_layout(title=dict(text="Population VS Difference over Time", font=dict(size=20)), xaxis=dict(title="Year", tickfont=dict(size=14)),
    yaxis=dict(title="Years Diff", tickfont=dict(size=14)), template="plotly_white", height=500, width=950)
fig.show()
  • notice the scaling on y-axis values.
  • A recurring trend in each of the charts is sharp decrease of grow difference value on 2020, probably due to the corona virus.
  • The difference on 2020 is like in 1960 !
  • As we can see, despite the covid - the world population has never stopped from growing.

Population Differences By country (1690-2021):¶

Try to change 'k' to any value (in the range) to see the according number of countries at any side of the scale:

In [23]:
countriesT = countries.T
countriesT.columns = countries['CountryName'].values
countriesT['West Bank and Gaza'] = countriesT['West Bank and Gaza'].replace(0,countriesT['West Bank and Gaza'][1990])
countriesT = countriesT.iloc[4:,]
In [24]:
# 1 <= k <= 216
k = 10
In [25]:
def total_growth(country):
    return int(countriesT[country].values[-1]) - int(countriesT[country].values[0])
In [26]:
dictionary = {}
for country in countriesT.columns:
    dictionary[country] = total_growth(country)
sortedict = dict(sorted(dictionary.items(), key = lambda item: item[1], reverse=False))
dict_lowest = dict(list(sortedict.items())[:k])
dict_highest = dict(list(sortedict.items())[-k:])

Plotting the Findings:¶

In [27]:
fig, (ax1, ax2) = plt.subplots(figsize=(20, 10), ncols=2)
ax1.bar(x=list(dict_lowest.keys()), height=dict_lowest.values(), color = 'orange')
ax2.bar(x=list(dict_highest.keys()), height=dict_highest.values(), color = 'blue')

ax1.set_xlabel('Country', size = 25)
ax1.set_ylabel('Population', size = 25)
ax1.set_title('"k" Countries with the Lowest Population Difference(millions)\n', size = 20)
ax2.set_xlabel('Country', size = 25)
ax2.set_title('"k" Countries with the Highest Population Difference(millions)\n', size = 20)

ax1.set_xticks(range(len(dict_lowest)))
ax1.set_xticklabels(dict_lowest.keys(), rotation=45, size = 15)
ax2.set_xticks(range(len(dict_highest)))
ax2.set_xticklabels(dict_highest.keys(), rotation=45, size = 15)
plt.tight_layout()

for i, (key, value) in enumerate(dict_lowest.items()):
    ax1.text(i, value + 1000, f"{value/1000000:.2f}M", ha='center', size = 15)

for i, (key, value) in enumerate(dict_highest.items()):
    ax2.text(i, value + 1000000, f"{value/1000000:.2f}M", ha='center', size = 15)

Population changes in relation to the existing situation ;¶

Try to change 'n' to any value (in the range) to see the according number of countries at any side of the scale:

In [28]:
#   1 <= n <= 216
n = 20
In [29]:
def relative_growth(country):
    return total_growth(country) / int(countriesT[country].values[-1])
dictionary2 = {}
for country in countriesT.columns:
    dictionary2[country] = relative_growth(country)
sortedict2 = dict(sorted(dictionary2.items(), key = lambda item: item[1], reverse=False))
dict_lowest2 = dict(list(sortedict2.items())[:n])
dict_highest2 = dict(list(sortedict2.items())[-n:])
dd = dict_lowest2 | dict_highest2
df = countries[countries.CountryName.isin(dd.keys())]
df['diff'] = (pd.to_numeric(df[2021]) - pd.to_numeric(df[1960]))
df['relative'] = pd.to_numeric(df[1960])/(pd.to_numeric(df[2021]))
df = df.sort_values('relative')
df = df.loc[:,['CountryName','CountryCode','diff','relative']]
In [30]:
fig1 = px.choropleth(df,locations='CountryCode', color='relative',
                    title='Population changes in relation to the existing situation - Relative:', hover_data=['CountryName'],
                    color_continuous_scale=px.colors.sequential.ice, height=500, width=800)
fig2 = px.choropleth(df,locations='CountryCode', color='diff',
                    title='Population changes in relation to the existing situation- Quantitative:',
                    hover_data=['CountryName'], color_continuous_scale=px.colors.sequential.YlOrRd, height=500, width=800)
fig1.update_layout(geo=dict(showframe=True, showcoastlines=False,projection_type='orthographic'))
fig2.update_layout(geo=dict(showframe=True, showcoastlines=False,projection_type='orthographic'))
fig1.show()
fig2.show()
  • For both estimators : Darker color -> Bigger change.
  • The highest rate of population relative growth belongs to the countries of the Middle East & Africe, while the lowest belongs to Europian countries.
  • Regarding the numeric changes - there's no noticeable unambiguous trend exept of slow growth among the population of Europe.

We can read any data about any country and cross it with it's population to investigate phenomena and interesting things, for example:¶

Population and Land area (present):¶

In [31]:
area = pd.read_csv(r"C:\Users\elon2\OneDrive\Desktop\programming\Data Analysis Projects\Data\land-area-km.csv")
In [32]:
countries[2021] = pd.to_numeric(countries[2021])
areanow = area[area.Year == 2021].rename(columns = {'Land area (sq. km)':'Area'})
popnow = countries[['CountryName','CountryCode',2021]]
popnow.rename(columns = {'CountryCode':'Code',2021:'population'}, inplace = True)
comb1 = areanow.merge(popnow, on = 'Code')[['CountryName','Code','population','Area']]
comb1.sample(10)
Out[32]:
CountryName Code population Area
19 Belize BLZ 400031 2.281000e+04
144 Norway NOR 5408320 3.651078e+05
184 Sudan SDN 45657202 1.849234e+06
37 Central African Republic CAF 5457154 6.229800e+05
50 Czechia CZE 10505772 7.720000e+04
130 Mozambique MOZ 32077072 7.863800e+05
39 Chile CHL 19493184 7.435320e+05
116 Malaysia MYS 33573874 3.285500e+05
133 Nauru NRU 12511 2.000000e+01
43 Congo, Rep. COG 5835806 3.415000e+05
  • pd.merge() function is similar to JOIN statement on SQL.
In [33]:
corr = comb1.corr()
fig, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(corr, annot=True, cmap="coolwarm", ax=ax)
ax.set_title("Population & Land Area: Medium Connection\n", fontsize=16)
plt.show()
In [34]:
comb1['PeoplePerKM2'] = comb1['population'] / comb1['Area']
comb1 = comb1.sort_values('PeoplePerKM2')
comb1[comb1.CountryName == 'Israel']
density = comb1[['CountryName','Code','PeoplePerKM2']]
density.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 215 entries, 76 to 113
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   CountryName   215 non-null    object 
 1   Code          215 non-null    object 
 2   PeoplePerKM2  215 non-null    float64
dtypes: float64(1), object(2)
memory usage: 6.7+ KB

Using pycountry library :¶

In [35]:
# Create dictionary with all countries to match each for its continent.
region_map = {}
for country in pycountry.countries:
    try:
        country_alpha_2 = country.alpha_2
        country_region = pc.country_alpha2_to_continent_code(country_alpha_2)
        region_map[country.name] = country_region
    except KeyError:
        pass
density['Region'] = density['CountryName'].map(region_map)
density = density.dropna(subset=["Region"])
new_labels = {"NA":"North America", "SA": "South America", "EU": "Europe", "AS":"Asia","AF": "Africa", "OC":"Oceania"}
density.Region = density.Region.map(new_labels)
density.sample(5)
Out[35]:
CountryName Code PeoplePerKM2 Region
153 Philippines PHL 381.930872 Asia
29 Bulgaria BGR 63.354302 Europe
81 Guinea-Bissau GNB 73.283108 Africa
108 Liberia LBR 53.918355 Africa
196 Tunisia TUN 78.932454 Africa
  • Got terrestrial density list by units of (people/km**2)
In [36]:
fig = px.scatter(density, x="CountryName", y="PeoplePerKM2", log_y=True, title="Population density Scale", color="Region")
fig.update_layout(width=1000, height=700, title_x=0.5, title_font=dict(size=32),
    xaxis=dict(title='Country', showgrid=True, gridwidth=2, gridcolor='lightgray', tickangle=45, tickfont=dict(size=13)),
    yaxis=dict(title='Population density (people/km²)', showgrid=True, gridwidth=1, gridcolor='lightgray', tickfont=dict(size=14)),
    legend_title="Continent",legend=dict(yanchor="top", y=1.5, xanchor="left", x=0.05))
fig.add_annotation(x=56, y=2.63, text="Israel is here", showarrow=True, font=dict(size=14))
fig.update_traces(marker=dict(size=10, opacity=0.8))
fig.show()

Focusing continent's mean values¶

In [37]:
meanDen = density.groupby("Region").mean().reset_index()
fig = px.bar(meanDen, x="Region", y="PeoplePerKM2", color="Region", title="Mean Population Density by Region")
fig.update_layout(width=950, height=600, title_x=0.5, title_font=dict(size=32),
    xaxis=dict(title='Region', showgrid=True, gridwidth=2, gridcolor='lightgray', tickangle=45, tickfont=dict(size=13)),
    yaxis=dict(title='Mean Population Density (people/km²)', showgrid=True, gridwidth=1, gridcolor='lightgray', tickfont=dict(size=14)),
    legend_title="Continent",legend=dict(yanchor="top", y=1.5, xanchor="left", x=-0.05))
fig.show()

Glance at the other part of the data:¶

In [38]:
rest = dat.iloc[:,2:].tail(48)
rest = rest.set_index(rest.CountryName).iloc[:,2:].T.reset_index()
rest = rest.rename(columns={'index':'Year'})
rest = rest.set_index(rest.Year)
rest = rest.iloc[:,1:]
rest = rest.astype(float)
rest.columns
Out[38]:
Index(['Africa Eastern and Southern', 'Africa Western and Central',
       'Arab World', 'Caribbean small states',
       'Central Europe and the Baltics', 'Early-demographic dividend',
       'East Asia & Pacific', 'East Asia & Pacific (excluding high income)',
       'East Asia & Pacific (IDA & IBRD countries)', 'Euro area',
       'Europe & Central Asia',
       'Europe & Central Asia (excluding high income)',
       'Europe & Central Asia (IDA & IBRD countries)', 'European Union',
       'Fragile and conflict affected situations',
       'Heavily indebted poor countries (HIPC)', 'High income', 'IBRD only',
       'IDA & IBRD total', 'IDA blend', 'IDA only', 'IDA total',
       'Late-demographic dividend', 'Latin America & Caribbean',
       'Latin America & Caribbean (excluding high income)',
       'Latin America & the Caribbean (IDA & IBRD countries)',
       'Least developed countries: UN classification', 'Low & middle income',
       'Low income', 'Lower middle income', 'Middle East & North Africa',
       'Middle East & North Africa (excluding high income)',
       'Middle East & North Africa (IDA & IBRD countries)', 'Middle income',
       'North America', 'OECD members', 'Other small states',
       'Pacific island small states', 'Post-demographic dividend',
       'Pre-demographic dividend', 'Small states', 'South Asia',
       'South Asia (IDA & IBRD)', 'Sub-Saharan Africa',
       'Sub-Saharan Africa (excluding high income)',
       'Sub-Saharan Africa (IDA & IBRD countries)', 'Upper middle income',
       'World'],
      dtype='object', name='CountryName')

The relationship between the population growth rate and income levels:¶

In [39]:
incomes = rest[['Heavily indebted poor countries (HIPC)','Low income','Lower middle income','Middle income','Upper middle income','High income']].reset_index()
incomes.sample(5)
Out[39]:
CountryName Year Heavily indebted poor countries (HIPC) Low income Lower middle income Middle income Upper middle income High income
24 1984 305097813.0 259338462.0 1.751864e+09 3.505455e+09 1.753590e+09 9.849705e+08
57 2017 770390212.0 643485701.0 3.225576e+09 5.679375e+09 2.453799e+09 1.224733e+09
37 1997 437721228.0 369731523.0 2.325019e+09 4.422451e+09 2.097433e+09 1.080857e+09
46 2006 563930944.0 476615219.0 2.727263e+09 4.986523e+09 2.259260e+09 1.144695e+09
5 1965 184949176.0 160529141.0 1.115860e+09 2.329439e+09 1.213579e+09 8.286284e+08
In [40]:
incomes_melted = pd.melt(incomes, id_vars='Year', var_name='Income Group', value_name='Population')
fig = px.line(incomes_melted, x='Year', y='Population', color='Income Group')
fig.update_layout(
    title='Population by Income Group over Time',
    xaxis_title='Year',
    yaxis_title='Population',
    legend_title='Income Group',
    font=dict(size=12),width = 1200,
    height=600)
fig.show()
incomes_10yrs = incomes[incomes['Year'].isin([1960, 1970, 1980, 1990, 2000, 2010, 2015, 2020, 2021])]
fig, ax = plt.subplots(figsize=(15, 8))
incomes_10yrs.plot.bar(x="Year", y=["Heavily indebted poor countries (HIPC)", "Low income", "Lower middle income", "Middle income", "Upper middle income", "High income"], ax=ax, rot=0, color=["#FF5733", "#FFFF66", "#00FF00", "#00CED1", "#800080", "#DC143C", "#008080"])
colors = ['#e6194b', '#f58231', '#ffe119', '#3cb44b', '#42d4f4', '#911eb4', '#4363d8']
for i, container in enumerate(ax.containers):
    for j, bar in enumerate(container):
        bar.set_color(colors[i])
legend_patches = [mpatches.Patch(color=colors[i], label=label) for i, label in enumerate(incomes_10yrs.columns[1:])]
plt.legend(handles=legend_patches)
plt.title("Population by Income Group in a Specific Year (1960-2021)\n")
plt.xlabel("Year")
plt.ylabel("Population")
plt.show()

regarding the mean value:¶

In [41]:
income_means = incomes_melted.groupby('Income Group')[['Population']].mean().sort_values('Population', ascending = False)
plt.figure(figsize=(12, 6))
sns.set_style("whitegrid")
sns.barplot(x='Population', y=income_means.index, data=income_means, palette='pastel')
plt.xlabel('Mean Population', fontsize=12, fontweight='bold')
plt.ylabel('Income Group', fontsize=12, fontweight='bold')
plt.title('Mean Population by Income Group\n', fontsize=16, fontweight='bold')
plt.xticks(fontsize=15, fontweight='bold')
plt.yticks(fontsize=15, fontweight='bold')
plt.show()
  • It can be concluded from the charts that among people with a middle income - the rate of growth is greater compared to those with a high income or alternatively - to those with a low income or poor areas.

Life Expectation (Present) :¶

In [42]:
life_exp = pd.read_csv(r"C:\Users\elon2\OneDrive\Desktop\programming\Data Analysis Projects\Data\life-expectancy.csv")
life_exp = life_exp.rename(columns = {'Entity':'CountryName','Code':'CountryCode','Life expectancy at birth (historical)':'lifeExp'})
life_exp21 = life_exp[life_exp.Year == 2021]
In [43]:
new_countries = countries.iloc[:,2:].set_index(['CountryName']).reset_index().iloc[1:,:][['CountryCode','CountryName',2021]].rename(columns = {2021:'Population2021'})
new_countries
Out[43]:
CountryCode CountryName Population2021
1 AFG Afghanistan 40099462
2 ALB Albania 2811666
3 DZA Algeria 44177969
4 ASM American Samoa 45035
5 AND Andorra 79034
... ... ... ...
212 VIR Virgin Islands (U.S.) 105870
213 PSE West Bank and Gaza 4922749
214 YEM Yemen, Rep. 32981641
215 ZMB Zambia 19473125
216 ZWE Zimbabwe 15993524

216 rows × 3 columns

In [44]:
merged1 = life_exp21.merge(new_countries, on = 'CountryCode').sort_values('lifeExp')[['CountryName_x','Population2021','lifeExp']].rename(columns = {'CountryName_x':'CountryName'})
merged1.sample(10)
Out[44]:
CountryName Population2021 lifeExp
84 Honduras 10278345 70.1
139 Nigeria 213401323 52.7
203 United Kingdom 67326569 80.7
174 Slovakia 5447247 74.9
24 Bosnia and Herzegovina 3270943 75.3
93 Isle of Man 84263 80.5
125 Monaco 36686 85.9
141 North Macedonia 2065092 73.8
175 Slovenia 2108079 80.7
70 Gambia 2639916 62.1
In [45]:
merged2 = density.merge(merged1, on ='CountryName')
merged2.sample(10)
Out[45]:
CountryName Code PeoplePerKM2 Region Population2021 lifeExp
124 Guatemala GTM 159.665416 North America 17109746 69.2
85 Morocco MAR 83.075474 Africa 37076584 74.0
105 Portugal PRT 112.713053 Europe 10325147 81.0
120 Isle of Man IMN 147.829825 Europe 84263 80.5
49 Lithuania LTU 44.720406 Europe 2800839 73.7
125 Andorra AND 168.157447 Europe 79034 80.4
46 Palau PLW 39.182609 Oceania 18024 66.0
128 Malawi MWI 210.964595 Africa 19889742 62.9
102 Northern Mariana Islands MNP 107.567391 Oceania 49481 77.2
112 United Arab Emirates ARE 131.866305 Asia 9365145 78.7
In [46]:
life_exp.groupby('CountryName')[['Year','CountryName','lifeExp']].min().Year.unique()
Out[46]:
array([1950, 1770, 1923, 1940, 1875, 1885, 1870, 1876, 1900, 1841, 1945,
       1931, 1831, 1930, 1899, 1895, 1775, 1927, 1920, 1897, 1755, 1816,
       1921, 1877, 1911, 1838, 1881, 1901, 1872, 1865, 1868, 1908, 1896,
       1924, 1893, 1926, 1850, 1948, 1846, 1946, 1938, 1932, 1882, 1751,
       1937, 1543, 1880], dtype=int64)
In [47]:
life_exp.groupby('CountryName')[['Year','CountryName','lifeExp']].max().Year.unique()
Out[47]:
array([2021], dtype=int64)
  • For each country, the highest life expectancy is at the present.
In [48]:
life_exp.CountryName.unique()[::10]
Out[48]:
array(['Afghanistan', 'Argentina', 'Barbados', 'Botswana', 'Cape Verde',
       'Costa Rica', 'Dominica', 'Europe', 'Georgia', 'Guernsey', 'India',
       'Jersey', 'Latin America and the Caribbean', 'Liechtenstein',
       'Mali', 'Monaco', 'Nepal', 'Northern America', 'Paraguay',
       'Rwanda', 'Sao Tome and Principe',
       'Small Island Developing States (SIDS)', 'Sweden', 'Tonga',
       'United Kingdom', 'Wallis and Futuna'], dtype=object)
for comparing between several countries - just enter the country name into the "selected_list" and run the next block:¶
In [49]:
selected_lst = ['Israel','Russia','Nicaragua','Chad','Guinea']
new_exp = life_exp[life_exp.CountryName.isin(selected_lst)]
fig = px.scatter(new_exp, x='Year', y='lifeExp',trendline='ols', hover_data=['CountryName'],color = 'CountryName')
fig.update_layout(title='Relationship between Time and Life Expectancy - Selected Countries',width=1050, height=600)
     
fig.update_xaxes(type='log')
fig.show()

Importing GDP per capita for each country (Present):¶

In [50]:
hadash = pd.read_csv(r"C:\Users\elon2\OneDrive\Desktop\programming\Data Analysis Projects\Data\GPD_per_capita_numeric.csv")
In [51]:
hadash21 = hadash[['Country Name','2021']].rename(columns = {'Country Name':'CountryName','2021':'GDP2021','lifeExp':'lifeExp2021'})
merged3 = hadash21.merge(merged2, on = 'CountryName').rename(columns = {'lifeExp':'lifeExp2021'})
merged3 = merged3.merge(comb1[['CountryName','Area']], on = 'CountryName')
merged3.sample(5)
Out[51]:
CountryName GDP2021 Code PeoplePerKM2 Region Population2021 lifeExp2021 Area
42 Djibouti 3150.436729 DJI 47.694435 Africa 1105557 62.3 23180.000
34 Colombia 6104.136709 COL 46.432233 South America 51516562 72.8 1109500.000
172 Vanuatu 2996.621062 VUT 26.180230 Oceania 319137 70.4 12190.000
43 Dominica 7653.171870 DMA 96.549333 North America 72412 72.8 750.000
150 South Sudan NaN SSD 17.008694 Africa 10748272 55.0 631928.125

Testing the relationship between general features:¶

In [52]:
plt.subplots(figsize=(14,10)) 
fig = sns.heatmap(merged3.corr(), annot = True, cmap = 'coolwarm')
fig.set_title("Correlation of GDP, general Density, Land-area, Life-Expectancy & Population \n", fontsize=16)
plt.show()

Connected Variables:¶

  • Population & Land-area (0.53) - slightly different than the same last calculation due to different data sample.
  • GDP & Life-expectation (0.62) - makes sense.
  • GDP & density - (0.59) considering the superficial calculation that was carried out (population size divided by area) - we will regard the density information conclusion with little suspicion (too general), but still, there is a connection.

¶

Displaying the correlations in a nicer way:¶

In [53]:
fig = px.scatter(merged3, x="GDP2021", y="lifeExp2021", color="Region",
                 hover_name="CountryName", size="Population2021", title="GDP per capita vs. Life expectancy (by Region) :",
                 size_max=75, log_x=True, log_y=False, width = 1100, height = 550)
fig.add_annotation(x=3, y=85, text="circles' size ---> Population", showarrow=False, font=dict(size=20))
fig.show()
In [54]:
fig = px.scatter(merged3, x="GDP2021", y="lifeExp2021", color="Region",
                 hover_name="CountryName", size="Area", title="GDP per capita vs. Life expectancy (by Region) :",
                 size_max=70, log_x=True, log_y=False,width = 1100, height = 550)
fig.add_annotation(x=3, y=88, text="circles' size ---> Land Area:", showarrow=False, font=dict(size=20))
fig.show()
In [55]:
fig = px.scatter(merged3, x="GDP2021", y="lifeExp2021", color="Region",
                 hover_name="CountryName", size="PeoplePerKM2", title="GDP per capita vs. Life expectancy (by Region) :",
                 size_max=80, log_x=True, log_y=False,width = 1100, height = 550)
fig.add_annotation(x=3.1, y=88, text="circles' size ---> General Density", showarrow=False, font=dict(size=20))
fig.show()

We can easily notice the connection between GDP and life-expectation.¶


Drag features to the X and Y axis for examine connections with pygwalker:¶

In [56]:
gwalker = pyg.walk(merged3)